

. Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners.
The dataset contains several .csv which contains information likes,order,customer,seller,feedback,product,geographic-location etc. After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments. . The dataset has been downloaded from Kaggle
Now a days data which is kind of information act as resources,and we often dig down and deeper inside to extract some information out of it. Hence EDA is kind of resource extraction and value addition in the our data. Apart from this other objective of the EDA is explore the unknown terrians of data and gather the some useful insights out of it.
Below are the outline as well as squential steps taken to dig down to dataset:
Performing the data-sechema and merging into one dataframe and cleaning with Pandas.
Refering the individual columns of the final-data frame framing the question to gather useful insights.
From the variety of option for executing the code on Jupyternotebook,kaggle,binder to Google cloud platform Colab, with very minimistilic or no changes code can be executed on the just few click of run button ( with make sure availability of all required packages).
Since dataset contains 89K of rows and 41 columns, so choosing the platform which can facilitate better ram for computing could be better options for getting smother executions.
Whereas the downloading the dataset from the Kaggle as well as commiting the notebook requires "API " Keys which one could get from their respective account credential.
In the process of EDA we will be using multiple library like numpy,seaborn,pandas,folium,wordcloud and plotly from python ecosystem. So let using python interpretor instalation to install the required packages
!pip install plotly opendatasets --quiet
!pip install seaborn --quiet
!pip install folium --quiet
!pip install wordcloud --quiet
!pip install pandas numpy --quiet
!pip install geopy --quiet
!pip install jsonpickle --quiet
After installing required libraries we will import those libraries and utilized for required work
# used to download a dataset from kaggle
import opendatasets as od
# Data analysis library in python
import pandas as pd
# numerical computing library
import numpy as np
# common uses of relative to find out date and time remaining or to apply any datetime calculation
from dateutil.relativedelta import relativedelta
# Python module named datetime used for working with dates as date objects
from datetime import datetime
# wordcloud tool can be used to generate clouds made up of large pool of words
from wordcloud import WordCloud
# for creating static visualizations in Python
import matplotlib.pyplot as plt
# a high-level interface for drawing informative statistical graphics.
import seaborn as sns
#for creating interactive visualizations
import plotly.express as px
#Plotting geospatial data with Folium
import folium
# to create wordcloud in python
import wordcloud
# Python module for tackling the javascript file
import json
# jsonpickle is a library for the two-way conversion of complex Python objects and JSON
import jsonpickle
# module’s encoders and decoders preserve input and output order by default
from json import JSONEncoder
import plotly.data as pdata
# Avoid graphs to be shown as pop-up rather shown in same command line
%matplotlib inline
import matplotlib
import multiprocessing, json
from matplotlib.gridspec import GridSpec
from matplotlib import rcParams
# Different module for making the different types of Folium graphs
from folium.plugins import FastMarkerCluster, Fullscreen, MiniMap, HeatMap, HeatMapWithTime, LocateControl
The dataset has been downloaded from kaggle using the kaggle account username and api_keys.
url='https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?select=olist_geolocation_dataset.csv'
od.download(url)
# Give kaggle account credential
#len(merged_df.columns)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: gautamkumarjha Your Kaggle Key: ········ Downloading brazilian-ecommerce.zip to ./brazilian-ecommerce
100%|██████████| 42.6M/42.6M [00:01<00:00, 40.9MB/s]
jovian.commit()
[jovian] Updating notebook "jhagautamkumar362/eda-project-working" on https://jovian.ai [jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/eda-project-working
'https://jovian.ai/jhagautamkumar362/eda-project-working'
df1=pd.read_csv('brazilian-ecommerce/olist_order_payments_dataset.csv',index_col= False)
df2=pd.read_csv('brazilian-ecommerce/olist_products_dataset.csv',index_col= False)
df3=pd.read_csv('brazilian-ecommerce/olist_order_reviews_dataset.csv',index_col= False)
df4=pd.read_csv('brazilian-ecommerce/olist_orders_dataset.csv',index_col= False)
df5=pd.read_csv('brazilian-ecommerce/olist_order_items_dataset.csv',index_col= False)
df6=pd.read_csv('brazilian-ecommerce/olist_sellers_dataset.csv',index_col= False)
df7=pd.read_csv('brazilian-ecommerce/olist_geolocation_dataset.csv',index_col=False)
df8=pd.read_csv('brazilian-ecommerce/olist_customers_dataset.csv',index_col= False)
#df1=pd.read_csv('brazilian-ecommerce/olist_customers_dataset.csv',index_col= False)
#df2=pd.read_csv('brazilian-ecommerce/olist_geolocation_dataset.csv',index_col= False)
#df3=pd.read_csv('brazilian-ecommerce/olist_order_items_dataset.csv',index_col= False)
#d=pd.read_csv('brazilian-ecommerce/olist_order_payments_dataset.csv',index_col= False)
#df5=pd.read_csv('brazilian-ecommerce/olist_order_reviews_dataset.csv',index_col= False)
#df6=pd.read_csv('brazilian-ecommerce/olist_orders_dataset.csv',index_col= False)
#df7=pd.read_csv('brazilian-ecommerce/olist_products_dataset.csv',index_col= False)
#df8=pd.read_csv('brazilian-ecommerce/olist_sellers_dataset.csv',index_col= False)
#df9=pd.read_csv('brazilian-ecommerce/product_category_name_translation.csv',index_col= False)

The downloaded dataset contains 9 different csv file. So before performing EDA we should follow the schema of dataset to merge all the csv file into one single .csv file, which can ease out the process of performing different python operation.
For Ex. data order_reviews_dataset,order_payments_dataset,order_customer_dataset,order_reviews etc dataset will be mereged on the basis of the common columns like order_id and similary other dataframe can be merged by following the schema.
Instead of working with bits and bytes of dataframe it's much better to work one dataframe which can be done by combing all the dataframe as merged dataframe.Then we can perform our analysis
df_merge=df2.merge(df5,on='product_id')
df_merge=df_merge.merge(df6,on='seller_id')
df_merge=df_merge.merge(df1,on='order_id')
df_merge=df_merge.merge(df3,on='order_id')
df_merge=df_merge.merge(df4,on='order_id')
df_merge=df_merge.merge(df8,on='customer_id')
df_merge.to_csv('df_merge_1,2,3,4,5,6,8.csv')
# Converting the final merged dataframe into single csv file
df=pd.read_csv('df_merge_1,2,3,4,5,6,8.csv')
Let's extract a copy of the data from these columns into a new data frame df_copy. We can continue to modify further without affecting the original data frame.
df_copy=df.copy()
Before the drawing the useful insight from the dataset, we will be checking different information like number of columns,rows,data types etc.By performing this steps we can get fair estimate of the dataset, on which we build our EDA
Let's view the dataset shape,size and columns
n_rows,n_column=df_copy.shape
n_column
40
df_copy.shape
(117329, 39)
Since all the required columns in the dataset doesn't have equal importance in terms of the analysis to be done or carried , so its bettter to drop the required columns
# Re running of only this cell will throw error because the columns has been already been deleted
df_copy = df_copy.drop('Unnamed: 0',axis=1)
df_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 117329 entries, 0 to 117328 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 product_id 117329 non-null object 1 product_category_name 115634 non-null object 2 product_name_lenght 115634 non-null float64 3 product_description_lenght 115634 non-null float64 4 product_photos_qty 115634 non-null float64 5 product_weight_g 117309 non-null float64 6 product_length_cm 117309 non-null float64 7 product_height_cm 117309 non-null float64 8 product_width_cm 117309 non-null float64 9 order_id 117329 non-null object 10 order_item_id 117329 non-null int64 11 seller_id 117329 non-null object 12 shipping_limit_date 117329 non-null object 13 price 117329 non-null float64 14 freight_value 117329 non-null float64 15 seller_zip_code_prefix 117329 non-null int64 16 seller_city 117329 non-null object 17 seller_state 117329 non-null object 18 payment_sequential 117329 non-null int64 19 payment_type 117329 non-null object 20 payment_installments 117329 non-null int64 21 payment_value 117329 non-null float64 22 review_id 117329 non-null object 23 review_score 117329 non-null int64 24 review_comment_title 13892 non-null object 25 review_comment_message 49679 non-null object 26 review_creation_date 117329 non-null object 27 review_answer_timestamp 117329 non-null object 28 customer_id 117329 non-null object 29 order_status 117329 non-null object 30 order_purchase_timestamp 117329 non-null object 31 order_approved_at 117314 non-null object 32 order_delivered_carrier_date 116094 non-null object 33 order_delivered_customer_date 114858 non-null object 34 order_estimated_delivery_date 117329 non-null object 35 customer_unique_id 117329 non-null object 36 customer_zip_code_prefix 117329 non-null int64 37 customer_city 117329 non-null object 38 customer_state 117329 non-null object dtypes: float64(10), int64(6), object(23) memory usage: 34.9+ MB
The final dataset contains 10 floats,5 ints,24 object datatype in our dataframe.So it seems that dataset doesn't contains any null values.
We'll convert the object type date column into datetime type using pd.to_datetime
Let's use pandas describe method to get statistical calculations of numerical columns.
df_copy.describe().style.background_gradient(cmap='Greens',axis=None)
| product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | order_item_id | price | freight_value | seller_zip_code_prefix | payment_sequential | payment_installments | payment_value | review_score | customer_zip_code_prefix | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 115634.000000 | 115634.000000 | 115634.000000 | 117309.000000 | 117309.000000 | 117309.000000 | 117309.000000 | 117329.000000 | 117329.000000 | 117329.000000 | 117329.000000 | 117329.000000 | 117329.000000 | 117329.000000 | 117329.000000 | 117329.000000 |
| mean | 48.768018 | 785.802861 | 2.205528 | 2110.763062 | 30.254456 | 16.612476 | 23.071452 | 1.194121 | 120.524349 | 20.027514 | 24450.781955 | 1.094452 | 2.940151 | 172.062565 | 4.031467 | 35060.118112 |
| std | 10.033831 | 652.382965 | 1.717783 | 3785.128931 | 16.177519 | 13.452625 | 11.745875 | 0.684225 | 182.944843 | 15.828077 | 27582.364358 | 0.731174 | 2.775370 | 265.388194 | 1.387927 | 29849.496175 |
| min | 5.000000 | 4.000000 | 1.000000 | 0.000000 | 7.000000 | 2.000000 | 6.000000 | 1.000000 | 0.850000 | 0.000000 | 1001.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1003.000000 |
| 25% | 42.000000 | 346.000000 | 1.000000 | 300.000000 | 18.000000 | 8.000000 | 15.000000 | 1.000000 | 39.900000 | 13.080000 | 6429.000000 | 1.000000 | 1.000000 | 60.750000 | 4.000000 | 11250.000000 |
| 50% | 52.000000 | 600.000000 | 1.000000 | 700.000000 | 25.000000 | 13.000000 | 20.000000 | 1.000000 | 74.900000 | 16.280000 | 13660.000000 | 1.000000 | 2.000000 | 108.100000 | 5.000000 | 24240.000000 |
| 75% | 57.000000 | 983.000000 | 3.000000 | 1800.000000 | 38.000000 | 20.000000 | 30.000000 | 1.000000 | 134.900000 | 21.180000 | 28035.000000 | 1.000000 | 4.000000 | 189.060000 | 5.000000 | 58770.000000 |
| max | 76.000000 | 3992.000000 | 20.000000 | 40425.000000 | 105.000000 | 105.000000 | 118.000000 | 21.000000 | 6735.000000 | 409.680000 | 99730.000000 | 29.000000 | 24.000000 | 13664.080000 | 5.000000 | 99990.000000 |
df_copy.tail().style.background_gradient(cmap='Greens',axis=0)
| product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | order_id | order_item_id | seller_id | shipping_limit_date | price | freight_value | seller_zip_code_prefix | seller_city | seller_state | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 117324 | daf9915404caf75fa1717a436b46fa14 | moveis_decoracao | 63.000000 | 379.000000 | 5.000000 | 1400.000000 | 33.000000 | 15.000000 | 30.000000 | 5d36e8ca3ccd1262b7641467fdba0b8e | 1 | 0f519b0d2e5eb2227c93dd25038bfc01 | 2017-01-31 12:10:46 | 79.900000 | 23.000000 | 14940 | ibitinga | SP | 1 | credit_card | 1 | 102.900000 | e587cf1954fc358343a8c510c8694269 | 5 | nan | Muito obrigada...vcs foram pontuais e uma loja com grande responsabilidade... | 2017-02-10 00:00:00 | 2017-02-11 13:30:56 | 3b2b96751131c948164edc5d21d27abb | delivered | 2017-01-27 12:10:46 | 2017-01-27 12:31:29 | 2017-01-30 10:26:15 | 2017-02-08 18:42:20 | 2017-03-20 00:00:00 | 55b3a0a9998356016bbb072bf8c188a6 | 60830 | fortaleza | CE |
| 117325 | b0498e44190727b728ae4490f2e9b6a5 | malas_acessorios | 25.000000 | 712.000000 | 4.000000 | 5250.000000 | 34.000000 | 23.000000 | 40.000000 | 8edaa376e19d08bc84ab8845682216b4 | 1 | dda37071807e404c5bb2a1590c66326f | 2018-04-30 09:30:37 | 199.990000 | 19.280000 | 3282 | sao paulo | SP | 1 | credit_card | 1 | 219.270000 | 7bbe7636141510158aa577dcfd017b13 | 5 | nan | nan | 2018-05-11 00:00:00 | 2018-05-13 00:12:43 | 4c1e29ec2ed2feac441cf24b25262ed2 | delivered | 2018-04-24 08:37:20 | 2018-04-24 17:26:25 | 2018-04-25 12:24:00 | 2018-05-10 22:36:37 | 2018-05-14 00:00:00 | 913a4e0cb7fe555e6cffa875ecd58e2e | 28455 | sao jose de uba | RJ |
| 117326 | 80b391b2dc6c958ef3ad34fa7ee01423 | automotivo | 17.000000 | 306.000000 | 4.000000 | 200.000000 | 26.000000 | 4.000000 | 13.000000 | 635c8d3ab47b7448b5d6c81dc9b7526e | 1 | 3aa3c89ae3cd482385568be76120f63c | 2018-06-14 02:57:41 | 29.900000 | 19.040000 | 74343 | goiania | GO | 1 | boleto | 1 | 48.940000 | 0a954efcadecd2bfa871c46f5fd05308 | 4 | nan | muito bom | 2018-06-19 00:00:00 | 2018-06-20 01:29:30 | d7b2339f118be9c6d98389f6d0c370d4 | delivered | 2018-06-07 22:36:29 | 2018-06-09 02:57:41 | 2018-06-14 13:59:00 | 2018-06-18 18:33:50 | 2018-06-26 00:00:00 | 0c2bb0bab2bf22858fa762d320a326bb | 3729 | sao paulo | SP |
| 117327 | 2b6535d32c6996c9478c131a8ff17a05 | construcao_ferramentas_seguranca | 50.000000 | 428.000000 | 2.000000 | 333.000000 | 16.000000 | 9.000000 | 10.000000 | b3e5d245a95dd6378330cc2249d9d6b4 | 1 | e64882b4ef12aee8d0faca4db5b681f4 | 2018-01-18 02:06:32 | 115.000000 | 21.610000 | 74210 | goiania | GO | 1 | boleto | 1 | 273.220000 | 9802486c19789491798384927fd28c30 | 4 | nan | nan | 2018-02-07 00:00:00 | 2018-02-07 20:09:14 | e08e569cab13d0c5d8b0899056edf5af | delivered | 2018-01-12 15:28:49 | 2018-01-13 02:06:32 | 2018-01-16 18:19:06 | 2018-02-06 15:41:31 | 2018-02-08 00:00:00 | 3ac652bdb8a0de18c0a73ce11f88efed | 88075 | florianopolis | SC |
| 117328 | 2b6535d32c6996c9478c131a8ff17a05 | construcao_ferramentas_seguranca | 50.000000 | 428.000000 | 2.000000 | 333.000000 | 16.000000 | 9.000000 | 10.000000 | b3e5d245a95dd6378330cc2249d9d6b4 | 2 | e64882b4ef12aee8d0faca4db5b681f4 | 2018-01-18 02:06:32 | 115.000000 | 21.610000 | 74210 | goiania | GO | 1 | boleto | 1 | 273.220000 | 9802486c19789491798384927fd28c30 | 4 | nan | nan | 2018-02-07 00:00:00 | 2018-02-07 20:09:14 | e08e569cab13d0c5d8b0899056edf5af | delivered | 2018-01-12 15:28:49 | 2018-01-13 02:06:32 | 2018-01-16 18:19:06 | 2018-02-06 15:41:31 | 2018-02-08 00:00:00 | 3ac652bdb8a0de18c0a73ce11f88efed | 88075 | florianopolis | SC |
By looking the individual columns we can frame open-ended question which is proven to be backbone for the EDA and we will try to answer those question either my numeric representation or visualization representation.
Let's set the chart style, font size, figure size and figure facecolour using matplotlib. These are reflected only to matplot and seaborn charts.
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Among different columns of the dataset,price columns gives the estimate of among of the money spend my the cutomers in his/her purchase. So let's explore this columns:
print('Min money spend by customer is {} R$'.format(df_copy.price.min()))
print('\nHighest money spend by customer is {} R$'.format(df_copy.price.max()))
print('\nAvg money spend by customer is {} R$'.format(df_copy.price.mean()))
Min money spend by customer is 0.85 R$ Highest money spend by customer is 6735.0 R$ Avg money spend by customer is 120.52434879709405 R$
On exploring this columns of the dataframe we would get information on among different mode of payment, which mode of payment is most favoured by the customer, followed by other modes. So lets explore this columns
# Checking for different type of values contained in column "payment_type_unique()"
df_copy.payment_type.unique()
array(['debit_card', 'credit_card', 'boleto', 'voucher'], dtype=object)
# Getting the number for each category of payment being done
df_copy.payment_type.value_counts()
credit_card 86520 boleto 22835 voucher 6282 debit_card 1692 Name: payment_type, dtype: int64
Now we want to seperate the text and number which is useful for making the pie chart. So let's convert from dataframe to dictionary, so that we could access keys and values seperately.
payment_dict=df_copy.payment_type.value_counts().to_dict()
payment_dict
{'credit_card': 86520, 'boleto': 22835, 'voucher': 6282, 'debit_card': 1692}
list(payment_dict.values())
[86520, 22835, 6282, 1692]
list(payment_dict.keys())
['credit_card', 'boleto', 'voucher', 'debit_card']
With the help of above information we can make pie chart
fig = px.pie(df_copy.payment_type.value_counts()
, values=list(payment_dict.values()), names= list(payment_dict.keys()), #color= list(payment_dict.keys()),
color_discrete_map={'Credit_card':'lightcyan',
'Debit_card':'cyan',
'Voucher':'royalblue',
'Boleto':'darkblue'},
hole=.4,
title='<b>Mode of payment did by customer at Olist stores</b>',
template = "none"
)
fig.show()
The inferences which i could draw from this pie chart is as following:
So this reflect that's majority of customer likes to pay their charges via credit card where as very small amount of people pay their charges via debit_card, which implies that E-Commerce stores needs to have card swiping machine really in good condition, altering which could hamper the revenue generation of the stores as well as difficulty for customer. If stores wants to recive the payment via debit-card then need some offers or some coupons or else some kind of endorsement to encourage the other mode of payment at stores.
For the overall experiances of customer in exchange of the amount of time spent as well goods and services purchased from the Olist stores, customer gives review score which is out of 5. So let's deep dive to see what % of customer gives 5 ,4,3,2 and 1 review scores.
x_1,y_1=df_copy[df_copy.review_score== 1].shape
x_2,y_2=df_copy[df_copy.review_score== 2].shape
x_3,y_3=df_copy[df_copy.review_score== 3].shape
x_4,y_4=df_copy[df_copy.review_score== 4].shape
x_5,y_5=df_copy[df_copy.review_score== 5].shape
print('{}% of Product got 1 score rating'.format(x_1*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 2 score rating'.format(x_2*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 3 score rating'.format(x_3*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 4 score rating'.format(x_4*100/(x_1+x_2+x_3+x_4+x_5)))
print('\n{}% of Product got 5 score rating'.format(x_5*100/(x_1+x_2+x_3+x_4+x_5)))
12.773205331300547% of Product got 1 score rating 3.456531319604083% of Product got 2 score rating 8.406684374143907% of Product got 3 score rating 19.118119989994877% of Product got 4 score rating 56.245458984956585% of Product got 5 score rating
So from this we could infer that out of all product, products for which 5.0 review score given is more and maximum than any of the review score given for any other product.
So among all the customers who have spend their money in purchasing goods and services at Olist stores 56.6 % of people are very happy for the kinds of services they got on the stores, where as there are considerable amount of 12.77 % people who feel or not happy with the kinds of services they got on Olist stores.
As from the Mode of payment columns we come to know that maximum customer pay their charges via credit card. And for the total amount they paid at Olist stores, instead of paying bulk amount customer pay in chunk of smaller amount periodically over period of time. Hence opt for based on the finance liability they opt for the required number of installment for clear their due. Hence it could be informative to explore column "payment_installments"
fig = px.histogram(df_copy, x="payment_installments",nbins=18,
title='<b>Number of installments for repaying bills</b>')
fig.update_traces(xbins=dict( # bins used for histogram
start=1,
end=25.0,
size=2
))
fig.show()
Inferences what we can draw from the above plots are below :
5490 customers repay their dues in 9-10 installment
Now this reflect that major chunk of customers get-rid of their loan in less just by paying very few installment, which reflect good financial support they have. Where as in compared to those customer very few 5490 customers repay their dues in 9-10 installment which took almost a year to clear his/her all dues, and this might be because of different financial constrains he/s he have. Now amount for which customer takes credit financial institutions are quickly able to recover his/her charges within a maximum of 1 year.
After getting the score for the product purchased by customer, its indeed necessary to check what kind of the review messages or comment given by consumer. So instead looking the individual words as the feedback messages, let's create the wordcloud to see most frequently reviewed comment given by customer, which is visual representation of words.
So let's create the cloud of word from the respective columns and see the feedback messages in the pictorial way.
words = ' '.join((i for i in df_copy.review_comment_title.dropna().str.replace(';', ' ').str.replace(',', ' ')))
wc = WordCloud( width=800, height=400, background_color='black',contour_color='#023075',contour_width=3,colormap='rainbow').generate(words)
plt.figure(figsize=(16,8))
plt.title('Customers feedbacks')
plt.axis("off")
plt.grid(False)
plt.imshow(wc);
words = ' '.join((i for i in df_copy.review_comment_message.dropna().str.replace(';', ' ').str.replace(',', ' ')))
wc = WordCloud( width=800, height=400, background_color='black',contour_color='#023075',contour_width=3,colormap='rainbow').generate(words)
plt.figure(figsize=(16,8))
plt.axis("off")
plt.grid(False)
plt.imshow(wc);
All the words which comes from wordcloud hasn't been written US or UK english,so it might required some work to convert these words into the simple english.
Among all type of the words comes as cloud some of the words comes are super, Excelente, Recomendo,boom, muito now all of these words represent state of statisfied customer and they recomend other customer also for purchasing the goods from Olis stores.And this very allign to the review score of 5,which consitute of 53% .
Whereas there exist few words like Intriguer which shows some bad Experience by the some of the customer, which on identification can be reached out to the customer and address the enquiry look for the remedies.
Often it is been Experienced that there exist a time lag between when we give our order and the acceptance of the same order for further processing of packaging,shipping,logistic assignment etc. If the time lag is very nominal then user experience in purchasing from the Olist stores doesn't get hampered much because more the time lag more there is delay in final reaching products to its customer result in more waiting time of customers.
Hence just to make sure that Olist customer remain untouched from this problem, it's recomended to just keep an eye on these factor involving time lag. 
# Checking the values present inside the required columns of the dataframe before performining algebric manipulation"
df_copy.order_purchase_timestamp
0 2018-04-24 16:16:53
1 2018-07-04 18:39:28
2 2018-04-23 19:03:35
3 2018-04-23 19:03:35
4 2018-07-29 20:50:22
...
117324 2017-01-27 12:10:46
117325 2018-04-24 08:37:20
117326 2018-06-07 22:36:29
117327 2018-01-12 15:28:49
117328 2018-01-12 15:28:49
Name: order_purchase_timestamp, Length: 117329, dtype: object
Any kind of the Algebraic operation we intend to do on columns containing date and time information, we first try to convert the required columns Time stamp 
# Created extra columns in the dataframe for converting the particular columns into datetime format data
df_copy['customer_ordered_datetime']=pd.to_datetime(df_copy.order_purchase_timestamp)
df_copy['order_approved_time']=pd.to_datetime(df_copy.order_approved_at)
df_copy['order_approved_time'].dt.time
0 19:04:19
1 16:33:14
2 19:17:45
3 19:17:45
4 18:31:13
...
117324 12:31:29
117325 17:26:25
117326 02:57:41
117327 02:06:32
117328 02:06:32
Name: order_approved_time, Length: 117329, dtype: object
df_copy['order_day'] = df_copy['customer_ordered_datetime'].dt.day_of_week
df_copy['order_month'] = df_copy['customer_ordered_datetime'].dt.month_name()
df_copy['order_year'] = df_copy['customer_ordered_datetime'].dt.year
df_copy['order_time'] = df_copy['customer_ordered_datetime'].dt.time
df_copy['Order_month_year'] =df_copy['customer_ordered_datetime'].dt.to_period('M')
After converting the all of the required columns into timestamp, creating the difference of two time stamp and them convert into result into the hour format.
Inferences what we could make from this plots are below :
On other hand very negligible or minute proportion of customer got the time lag of 2 Month, which could have resulted because of the non-availability of products or any geographical Challenges etc .
The general trend of the graphs narrate the story that majority of the customer get Immediate confirmation after placing their order, which indicate of good frequency match between different department of Olist. For the handful number of delay could be lack of product to anything like not able to find the respective logistic partner.
After we got estimate of time lag, it's customary to know the details, of the product ordered by customer has finally reached to it's destination or not ? After all with minimal herdal and delay product should reach to it's customers.
So let's explore the required columns which gives the required informations.
orderStatus_dict=df_copy.order_status.value_counts().to_dict()
fig = px.pie(df_copy.order_status.value_counts()
, values=list(orderStatus_dict.values()), names= list(orderStatus_dict.keys()), color= list(orderStatus_dict.keys()),
title='<b>Customers order delivery status</b>',
color_discrete_map={'delivered':'darkcyan',
'shipped':'cyan',
'canceled':'royalblue',
'invoiced':'darkblue',
'processing':'green',
'unavailable':'yellow',
'iapproved':'red'}
)
fig.show()
Figure above shown is self exploratary in terms of numbers and informations it's draws.
Apart from these as before product getting delivered to it's final destination, products has to pass different layer like invoiced, unavailability, shipped, under-process etc. And the contribution from each of the individual category is less than 0.5 % .
So we can say that overall management of the product insuring the delivery is good, whereas 3% of product which is not getting delivery need to have a look and investigate the reason behind the delay
As different product is being supplied by different sellers which comes from different geographical locations, so it's very much needed to know their proportions numbers corresponding to city. This can help in maintaining stock in different locations which can reduce the transportation time and cost for customers as well as company.
So let's deep dive into the above column
sellers_dict=df_copy.seller_city.value_counts().head(30).to_dict()
fig = px.bar(df_copy.seller_city.value_counts(), x=list(sellers_dict.keys()), y=list(sellers_dict.values()),
color=list(sellers_dict.keys()),
title='Top 30 city of Brazil having maximum number of sellers:')
fig.show()
#import requests
#import urllib.parse
#address = 'Rahika, Madhubani, BR 847238'
#adress= str(df_copy.seller_city +','+df_copy.seller_state +',' + str(df_copy.seller_zip_code_prefix) )
#url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
#response = requests.get(url).json()
#print(response[0]["lat"])
#print(response[0]["lon"])
After getting the severals of details including methods of payments,feedback,location of seller.It's most important to investigate what are the different category of products which are being sold by E-commerece olist stores.Doing this is going to give fair estimates of customers choice and preference for ordering product of their choices and taste.
While traversing through the name of columns of the data frame, column name as "product_category_name" contains the details of different category of products. Also apart from just getting the product category name, our attempt should also be to explore for where maximum mode of payment is credit card which category of products are being ordered most etc. So let's try to give visual representation to this column. So straight jump into the content of the columns.
df_copy.product_category_name.value_counts().head(15)
cama_mesa_banho 11847 beleza_saude 9944 esporte_lazer 8942 moveis_decoracao 8743 informatica_acessorios 8105 utilidades_domesticas 7331 relogios_presentes 6161 telefonia 4692 ferramentas_jardim 4558 automotivo 4356 brinquedos 4246 cool_stuff 3964 perfumaria 3575 bebes 3178 eletronicos 2827 Name: product_category_name, dtype: int64
%%time
# Check execution time of code-cell
product_cat_dict=df_copy.product_category_name.value_counts().nlargest(30).to_dict()
fig = px.bar(df_copy.product_category_name.value_counts(),
x=list(product_cat_dict.values()),
y=list(product_cat_dict.keys()),
#hover_data=['lifeExp', 'gdpPercap'],
color=product_cat_dict,
title="<b>Top 30 product category purchased by customers </b>",
orientation = 'h',
)
fig.update_layout(xaxis_title='Count of Product category',
yaxis_title='Product category')
fig.show()
CPU times: user 55.2 ms, sys: 1.99 ms, total: 57.2 ms Wall time: 53.7 ms
# Segregating the required columns of dataframe
df_product_photos_qty=df_copy.groupby('product_category_name').product_photos_qty
df_product_photos_qty.value_counts()
product_category_name product_photos_qty
agro_industria_e_comercio 1.0 91
2.0 50
3.0 23
6.0 9
8.0 9
..
utilidades_domesticas 8.0 39
10.0 24
9.0 9
12.0 2
13.0 1
Name: product_photos_qty, Length: 520, dtype: int64
product_length_cm,product_height_cm,product_height_cm¶After getting every details of related to products like category, product seller and it's zip code etc it's indeed necessary to know the volume of the products which is crucial property for sending the product via logistic partners. Based on product volume different logistic partner has been assigned as well cost of transportation is also being calculated based on the volume of product. Hence it's crucial information to be known as products and it's dimension is being concern.
So the dimension of products is being know by getting the length, breadth and height of the product. So we can perform the algebric operations on each of the columns. And then group the product based on the category and this way would get the estimate of the volume of product in respective category.
df_copy['Product_volume']=(df_copy.product_length_cm * df_copy.product_height_cm * df_copy.product_width_cm)/1000
df_copy['Product_volume'].sample()
98541 6.4 Name: Product_volume, dtype: float64
After getting the necessary depth of the EDA perform on the every columns of the dataset.Let's use these analysis did in the above cell to answer some of the pin points question and finding necessary answer for these questions:
The answer of this question revolve around extracting the information of one column product_category_name to another payment_category:
So on combining the information drawn from both columns would give the desired answer :
%%time
# Filtering the mode of payment as "credit_card"
df_copy_credit_card=df_copy[df_copy.payment_type=='credit_card']
# Using the filtered data top 20 product category : df_copy_credit_card.product_category_name.value_counts().nlargest(20)
line_colors = ["#0a2e36", "#27FB6B","#14cc60","#036d19","#09a129"]
credit_card_Payment_product_cat=df_copy_credit_card.product_category_name.value_counts().nlargest(20).to_dict()
fig = px.treemap( df_copy_credit_card.product_category_name.value_counts().nlargest(20),
path=[list(credit_card_Payment_product_cat.keys())],
values=df_copy_credit_card.product_category_name.value_counts().nlargest(20),
width=1000, height=850,
#values=list(credit_card_Payment_product_cat.values()),
title="<b>Category of product purchased by credit card </b>",
color='product_category_name',
color_discrete_sequence=line_colors,
template = "none",
#color_continuous_scale='RdBu'
)
fig.show()
CPU times: user 221 ms, sys: 29.3 ms, total: 250 ms Wall time: 245 ms
Let's try to interpret the result of this tree map: This plots represent the list of Top 20 products which are bieng purchased via credit card or we can also say that credit card payment which was most favourable method of payment was used for purchasing which type of product.
Now the each box in the graph show one category of product, hence 20 product so 20 rectangular box. And we know that rectangular box dimension represent the counting the category of product.So whichever rectangular box (means particular product category) have higher length and breadth represent larger dimension means that product category has larger counts. And similarly those box (product category) which looks very smaller means smaller counts i.e that product category have smaller counts.
Following this interpration we can say that product category name cama_mesa_banho have larger counts followed by beleza_saude and esporte_larzer have
So getting this information can also help in categorizing the product which can be used for getting the different logistic partner for smoother delivery as well those products can required bigger space.
Answer to this question can be grab by combining the information of two columns which are product_category_name & Product volume of the data frame df_copy.
So first we will segregate the information of volume from the data frame and then we will find the particular product category.
df_great_vol=df_copy[df_copy['Product_volume']> df_copy['Product_volume'].mean()]
# Droping the na values from particular columns of the dataframes
df_great_vol=df_great_vol.dropna(axis=0,subset=['product_category_name'])
df_great_vol=df_great_vol.dropna(axis=0,subset=['order_status'])
# Rechecking there exist any na values
df_great_vol['product_category_name'].isna() .sum()
0
# Rechecking there exist any na values
df_great_vol['order_status'].isna() .sum()
0
great_vol_dict=df_great_vol.product_category_name.value_counts().nlargest(15).to_dict()
fig = px.pie(df_great_vol.product_category_name.value_counts().nlargest(15)
, values=list(great_vol_dict.values()), names= list(great_vol_dict.keys()), color= list(great_vol_dict.keys()),
title="<b>Category of product whose volume greater than mean volume </b>",
hole=0.4
)
fig.update_traces(textposition='inside')
fig.show()
On looking this graphs it's seen that circle consist of different section or whole circle has been divided into different pieces. So each slice of circle represent each category of products and wider the slice of the circle, it means larger proportion of particular property.And the color of the each slice of the circle represent corresponding volume, and the numeric representation has been shown on the scale meter. So on combing all the information we can infer that.
Hence we can say that product_categoryname cama_mesa_banho, utilidades_domesticas,moveis_decoracao followed by casa_conforto have smaller count, which gives information of larger vehicle transportation for delivery of these product category.
Answer of this question involves around exploring the columns like shipping_limit_date, price, freight_value,order_status, order_purchase_timestamp, order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, order_estimated_delivery_date. Now several information like days,month,year,time-stamp and other details has to be extracted on finer level.
%%time
df_1=df_copy.sort_values(by=['payment_value'],ascending=False)
sns.set_style("darkgrid")
sns.set(rc={"figure.figsize":(15, 8)})
sns.barplot(data=df_copy, x='order_month', y='payment_value',hue='order_year')
plt.xlabel("Ordered Month" , size = 12)
plt.ylabel("Payment_Amount" , size = 12)
plt.title(" Order_month vs Customers payment_value for different years" , size = 24)
CPU times: user 1.2 s, sys: 0 ns, total: 1.2 s Wall time: 1.2 s
Text(0.5, 1.0, ' Order_month vs Customers payment_value for different years')
Inferences which can be drawn from this plot can be explained in Tabular manner for better understanding

So we can say year 2018 Olist stores was able to generate constant revenue throughout the year, where as in the year 2016 stores revenue generation was varying in nature and not as steady as 2018.But the year 2017 Olist revenue generation follow steady pattern as year 2018.
Often the result of which month of year has been benifical for the Olist stores, as maximum number of order received play crucial role as getting the details of financial year.
So we will try to extract the month&year in concatinated way and takes it's counts as a measure of the number of order received at the E-commerece stores.
order_month_dict=df_copy['Order_month_year'].value_counts().to_dict()
df_copy['Order_month_year']=df_copy['customer_ordered_datetime'].dt.year.astype(str) + "-" + df_copy['customer_ordered_datetime'].dt.month.astype(str)
fig = px.bar(df_copy['Order_month_year'].value_counts(),
y=list(order_month_dict.values())
)
fig.update_layout(
title="<b>Customer order month of particular year wise</b>",
xaxis_title="Month of year",
yaxis_title="No.of orders",
font=dict(family='Arial',
size=14,
)
)
fig.show()
The story which is being narated from this bar graph is for Olist stores as E-Commerce stores the Oct month of 2016 has received the minimalistic orders which counts for 5K from the customers, whereas the among all other year, Oct-Nov of year 2017 has received the maximum order whose counts is 9K. And the other months of the subsequent year of Olist stores has received the nearly same order, with minimalistic differences between the number of orders received in the different years i.e acrossyear 2018 Olist stores has received the ~ same orders.
As the Olist stores has been spread across different geographical locations of Brazil,but different geographical location may or may not contribute to same number of orders coming from customers to E-Commerece stores.So to check the numbers of orders Olist stores receive from different geo-location is being Analyzed in this plots.
But direction mapping from address or pin-code may not directly result in different geographical location. So here first attempt should be made to know the latitude and Longitude corresponding to different geo location of Brazils
import requests
r = requests.get('https://servicodados.ibge.gov.br/api/v1/localidades/mesorregioes')
content = [c['UF'] for c in json.loads(r.text)]
br_info = pd.DataFrame(content)
br_info['nome_regiao'] = br_info['regiao'].apply(lambda x: x['nome'])
br_info.drop('regiao', axis=1, inplace=True)
br_info.drop_duplicates(inplace=True)
#Brazils most Northern spot is at 5 deg 16′ 27.8″ N latitude.;
olist_geolocation=df7
geo_prep = olist_geolocation[olist_geolocation.geolocation_lat <= 5.27438888]
#it’s most Western spot is at 73 deg, 58′ 58.19″W Long.
geo_prep = geo_prep[geo_prep.geolocation_lng >= -73.98283055]
#It’s most southern spot is at 33 deg, 45′ 04.21″ S Latitude.
geo_prep = geo_prep[geo_prep.geolocation_lat >= -33.75116944]
#It’s most Eastern spot is 34 deg, 47′ 35.33″ W Long.
geo_prep = geo_prep[geo_prep.geolocation_lng <= -34.79314722]
geo_group = geo_prep.groupby(by='geolocation_zip_code_prefix', as_index=False).min()
After successful Filtration of different geographical locations, our attempt should be made to grab the required the necessary columns of the dataframe and merge it to get geo-mapping from br_info & geo_group correspondingly to latitude and longitude of dataframe.
# Merging all the informations
df_orders_items = df_copy.merge(br_info, how='left', left_on='customer_state', right_on='sigla')
df_orders_items = df_orders_items.merge(geo_group, how='left', left_on='customer_zip_code_prefix',
right_on='geolocation_zip_code_prefix')
df_orders_items.head().style.background_gradient(cmap='ocean',axis=None)
| product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | order_id | order_item_id | seller_id | shipping_limit_date | price | freight_value | seller_zip_code_prefix | seller_city | seller_state | payment_sequential | payment_type | payment_installments | payment_value | review_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | customer_ordered_datetime | order_approved_time | order_day | order_month | order_year | order_time | Order_month_year | Product_volume | id | sigla | nome | nome_regiao | geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.000000 | 287.000000 | 1.000000 | 225.000000 | 16.000000 | 10.000000 | 14.000000 | e17e4f88e31525f7deef66779844ddce | 1 | 5670f4db5b62c43d542e1b2d56b0cf7c | 2018-04-30 17:33:54 | 10.910000 | 7.390000 | 3694 | sao paulo | SP | 1 | debit_card | 1 | 18.300000 | d71da8fd8c6e3adef26be965f065b8a1 | 5 | Vale a pena | Super adooorei o delineador ele é bem preto e eu smp usei o da Mac e serio o da belle angel me surpreendeu pq eu achei que por ser barato iria ser ruim mas NAO é ... o unico problema é q ele é pequeno | 2018-04-28 00:00:00 | 2018-05-04 00:55:02 | f8a3e963a310aa58b60a5b1fed5bceb5 | delivered | 2018-04-24 16:16:53 | 2018-04-24 19:04:19 | 2018-04-25 17:01:00 | 2018-04-27 16:42:17 | 2018-05-09 00:00:00 | b1a1199364a4a7fe27c4486ab63f550d | 13848 | mogi-guacu | SP | 2018-04-24 16:16:53 | 2018-04-24 19:04:19 | 1 | April | 2018 | 16:16:53 | 2018-4 | 2.240000 | 35 | SP | São Paulo | Sudeste | 13848.000000 | -22.358101 | -46.943397 | mogi guacu | SP |
| 1 | a035b83b3628decee6e3823924e0c10f | perfumaria | 53.000000 | 2235.000000 | 3.000000 | 1450.000000 | 20.000000 | 25.000000 | 20.000000 | b18cb761efbe70da4838435a349abd07 | 1 | 5670f4db5b62c43d542e1b2d56b0cf7c | 2018-07-06 19:31:03 | 268.380000 | 21.070000 | 3694 | sao paulo | SP | 1 | credit_card | 10 | 289.450000 | 0f0f85749cb82321d902e390a9c33694 | 4 | nan | nan | 2018-07-11 00:00:00 | 2018-07-15 21:32:04 | 9ff6d1a05ecfe2d907adac11a2e2bfe9 | delivered | 2018-07-04 18:39:28 | 2018-07-05 16:33:14 | 2018-07-05 13:21:00 | 2018-07-10 22:34:39 | 2018-07-27 00:00:00 | fc68ba9c34778e17224154c255b5656e | 84200 | jaguariaiva | PR | 2018-07-04 18:39:28 | 2018-07-05 16:33:14 | 2 | July | 2018 | 18:39:28 | 2018-7 | 10.000000 | 41 | PR | Paraná | Sul | 84200.000000 | -24.282197 | -49.735125 | jaguariaiva | PR |
| 2 | 091107484dd7172f5dcfed173e4a960e | perfumaria | 50.000000 | 260.000000 | 2.000000 | 183.000000 | 16.000000 | 8.000000 | 13.000000 | a7708ffa8966514c098d15e1abfa6417 | 1 | 5670f4db5b62c43d542e1b2d56b0cf7c | 2018-04-27 19:31:35 | 7.650000 | 10.550000 | 3694 | sao paulo | SP | 1 | credit_card | 2 | 57.750000 | 39489f8518181b82cb86ce264e1ccc55 | 5 | Agora sim | Recebido o pedido finalmente produto ótimo | 2018-04-26 00:00:00 | 2018-04-30 21:15:06 | 360782688ff472f1adbd47a85861751e | delivered | 2018-04-23 19:03:35 | 2018-04-24 19:17:45 | 2018-04-24 16:22:36 | 2018-04-25 23:02:39 | 2018-05-15 00:00:00 | c67329fd07872a68d6d6d1f1626a0760 | 3804 | sao paulo | SP | 2018-04-23 19:03:35 | 2018-04-24 19:17:45 | 0 | April | 2018 | 19:03:35 | 2018-4 | 1.664000 | 35 | SP | São Paulo | Sudeste | 3804.000000 | -23.505101 | -46.487342 | sao paulo | SP |
| 3 | e8b61f78db501ea0ca45677d1ca27de2 | brinquedos | 49.000000 | 120.000000 | 1.000000 | 400.000000 | 50.000000 | 10.000000 | 25.000000 | a7708ffa8966514c098d15e1abfa6417 | 2 | 3d871de0142ce09b7081e2b9d1733cb1 | 2018-04-27 19:31:35 | 29.000000 | 10.550000 | 13232 | campo limpo paulista | SP | 1 | credit_card | 2 | 57.750000 | 39489f8518181b82cb86ce264e1ccc55 | 5 | Agora sim | Recebido o pedido finalmente produto ótimo | 2018-04-26 00:00:00 | 2018-04-30 21:15:06 | 360782688ff472f1adbd47a85861751e | delivered | 2018-04-23 19:03:35 | 2018-04-24 19:17:45 | 2018-04-24 16:22:36 | 2018-04-25 23:02:39 | 2018-05-15 00:00:00 | c67329fd07872a68d6d6d1f1626a0760 | 3804 | sao paulo | SP | 2018-04-23 19:03:35 | 2018-04-24 19:17:45 | 0 | April | 2018 | 19:03:35 | 2018-4 | 12.500000 | 35 | SP | São Paulo | Sudeste | 3804.000000 | -23.505101 | -46.487342 | sao paulo | SP |
| 4 | ccac9976bafbf7e587bd2c29302e2314 | perfumaria | 53.000000 | 2235.000000 | 3.000000 | 1500.000000 | 20.000000 | 25.000000 | 20.000000 | 206d1a13596872a713dba14504fdf699 | 1 | 5670f4db5b62c43d542e1b2d56b0cf7c | 2018-08-01 16:35:17 | 268.380000 | 18.070000 | 3694 | sao paulo | SP | 1 | credit_card | 8 | 286.450000 | 36b616f738d1f617cc12aeb2c01705f3 | 3 | nan | nan | 2018-08-07 00:00:00 | 2018-08-07 22:56:38 | 58c5eb8e7844d1ecd85128e663f49e04 | delivered | 2018-07-29 20:50:22 | 2018-07-30 18:31:13 | 2018-08-01 15:24:00 | 2018-08-06 18:36:46 | 2018-08-15 00:00:00 | 1b6d2664b20a04a1fa9c900aedf29440 | 71920 | brasilia | DF | 2018-07-29 20:50:22 | 2018-07-30 18:31:13 | 6 | July | 2018 | 20:50:22 | 2018-7 | 10.000000 | 53 | DF | Distrito Federal | Centro-Oeste | 71920.000000 | -15.832732 | -48.035078 | brasilia | DF |
Total_rows,Total_col=df_orders_items.shape
Total_rows,Total_col
(117329, 56)
Acquiring only required number of columns of the dataframe, used to getting particular plots.
df_regions_group = df_orders_items.groupby(by=['Order_month_year', 'nome_regiao'], as_index=False)
df_regions_group = df_regions_group.agg({'customer_id': 'count', 'price': 'sum'}).sort_values(by='Order_month_year')
df_regions_group.columns = ['month', 'region', 'order_count', 'order_amount']
df_regions_group.reset_index(drop=True, inplace=True)
df_regions_group = df_orders_items.groupby(by=['Order_month_year', 'nome_regiao'], as_index=False)
df_regions_group = df_regions_group.agg({'customer_id': 'count', 'price': 'sum'}).sort_values(by='Order_month_year')
df_regions_group.columns = ['month', 'region', 'order_count', 'order_amount']
df_regions_group.reset_index(drop=True, inplace=True)
df_regions_group
| month | region | order_count | order_amount | |
|---|---|---|---|---|
| 0 | 2016-10 | Centro-Oeste | 20 | 2435.85 |
| 1 | 2016-10 | Nordeste | 37 | 5796.65 |
| 2 | 2016-10 | Norte | 7 | 1127.30 |
| 3 | 2016-10 | Sudeste | 244 | 32298.08 |
| 4 | 2016-10 | Sul | 71 | 8614.59 |
| ... | ... | ... | ... | ... |
| 104 | 2018-8 | Norte | 85 | 14748.32 |
| 105 | 2018-8 | Sul | 998 | 118584.98 |
| 106 | 2018-8 | Centro-Oeste | 388 | 48628.74 |
| 107 | 2018-8 | Nordeste | 488 | 73239.31 |
| 108 | 2018-9 | Sudeste | 1 | 145.00 |
109 rows × 4 columns
fig = px.scatter(df_regions_group, x="month", y="order_amount", size='order_amount',
symbol="region",
color="region",
title='<b>Orders across different Region of Brazil</b>',
)
fig.update_layout(
title="Revenue generation across different Region of Brazil",
yaxis_title="Orders_Amount",
xaxis_title="Month of Year",
font=dict(family='Arial',
size=14,
)
)
fig.show()
The above graphs shows the amount of money customers pays for buying products varying across the different region of the Brazil. So in the graphs each markers shows the different Geo region of the Brazil and for whichever size of marker is bigger than other which means that people of that region spend more at that stores.
Therefor story which is being depicted from the above graphs can be explained as following:
Pople of Sudeste region expenditure has increased from the month of Jan of year 2017 from minimalistic to maximum expenditure in the month of Jan 2018 after which in subsequent year Jul 2018 has become saturated and maintain constant expenditure afterward.
People of Sul region have expenditure follow afterward, where in the month of Jan,2017 there were minimalistic expenditure in the Olist stores, afterward the people of that region expenditure has increased to ~ BRL 70K , afterward highest amount of expenditure done by people at Olist stores has reached up to ~ BRL 1500K in the month of Nov-Dec 2018.Afterward this month of year people expenditure has been maintained constant.
But there exist minimalistic expenditure difference did at Olist stores among the people of region Nordeste & Centro-Oeste region.
Whereas the least expenditure did by the people are from Norte region of Brazil, with mean spending comes as ~BRL 90K
After getting the track on amount of money people of different region of Brazil spend, it is also necessary to know the hows the number orders vary i.e how many orders people from different region of Brazil has given to Olist E-Commerce stores Month of yearaccordingly.
fig, ax = plt.subplots(figsize=(18, 8)) # Putting the dimension x-axis as 18 and y-axis as 8
sns.lineplot(x='month', y='order_count',data=df_regions_group, hue='region',
size='region', style='region', palette='magma', markers=['o'] * 5
)
sns.set_style('darkgrid')
plt.xlabel("Month of Year" , size = 12)
plt.ylabel("No. of orders" , size = 12)
plt.title("Number of orders across Brazil's various regions" , size = 24)
plt.figure(figsize=(15,20))
#plt.figure(figsize=(15,20))
<Figure size 1080x1440 with 0 Axes>
<Figure size 1080x1440 with 0 Axes>
The above graphs depict the number of orders coming from different region of Brazil spanning across Month of Year. So following below points can be concluded from the above graphs :
Sudeste regions, and counts is 65K.Sul have 2nd highest number of orders given at Olist stores.Centro-Oeste,Nordeste, Norte variation of number of orders at Olist stores have minimal differences,as well as the line representing number of orders coming to olist stores vary linear as constantly .Oct-Nov of year 2017 has registred the highest number of orders across different Geographical region of Brazil.2016,2017,2018 ?¶To obtain the number of consumers scattered across the various cities in Brazil, 30K rows of data have been collected from the dataset pool for quick estimation.So on proceeding first have to filter with order_year for different year 2016,2017,2018 and get correspondingly latitude and longitude from the dataframe.
lats = list(df_orders_items.query('order_year == 2017')['geolocation_lat'].dropna().values)[:30000]
longs = list(df_orders_items.query('order_year == 2017')['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
loc = 'No. of cutomers in Year 2017 from different city of Brazil'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
# Creating a mapa using folium
map1 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map1)
folium.TileLayer('Stamen Terrain').add_to(map1)
map1.get_root().html.add_child(folium.Element(title_html))
map1.save('map-with-title.html')
map1
lats = list(df_orders_items.query('order_year == 2016')['geolocation_lat'].dropna().values)[:30000]
longs = list(df_orders_items.query('order_year == 2016')['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
# Adding title to graphs
loc = 'No. of cutomers in Year 2016 from different city of Brazil'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
# Creating a mapa using folium
map2 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map2)
folium.TileLayer('Stamen Terrain').add_to(map2)
map2.get_root().html.add_child(folium.Element(title_html))
map2.save('map-order_year == 2016.html')
map2
lats = list(df_orders_items.query('order_year == 2018')['geolocation_lat'].dropna().values)[:30000]
longs = list(df_orders_items.query('order_year == 2018')['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
loc = 'No. of cutomers in Year 2018 from different city of Brazil'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
# Creating a mapa using folium
map3 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map3)
folium.TileLayer('Stamen Terrain').add_to(map3)
map3.get_root().html.add_child(folium.Element(title_html))
map3.save('map-withorder_year == 2018.html')
map3
Here
%%time
# Grouping geolocation data for plotting a heatmap
heat_data = df_orders_items.groupby(by=['geolocation_lat', 'geolocation_lng'], as_index=False).count().iloc[:, :3]
# Creating a mapa using folium
map1 = folium.Map(
location=[-15, -50],
zoom_start=4.0,
tiles='cartodbdark_matter'
)
# Plugin: HeatMap
HeatMap(
name='Mapa de Calor',
data=heat_data,
radius=10,
max_zoom=13
).add_to(map1)
folium.TileLayer('Stamen Terrain').add_to(map1)
map3.save('map-withorder_geolocation.html')
map1
CPU times: user 467 ms, sys: 5.97 ms, total: 473 ms Wall time: 471 ms
undelivered,canceled,&shipped ?¶df_copy_order_canceled=df_orders_items[df_orders_items['order_status']=="canceled"]
lats = list(df_copy_order_canceled['geolocation_lat'].dropna().values)[:30000]
longs = list(df_copy_order_canceled['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
loc = 'Region of Brazil where orders is getting canceled'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
# Creating a mapa using folium
map1 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map1)
folium.TileLayer('Stamen Terrain').add_to(map1)
map1.get_root().html.add_child(folium.Element(title_html))
map1.save('map-with-order_status_canceled.html')
map1
df_copy_order_canceled.nome_regiao.value_counts()
Sudeste 431 Sul 62 Nordeste 34 Centro-Oeste 22 Norte 4 Name: nome_regiao, dtype: int64
The conclusion which can be depicted from the above plots ( for the sample of dataset of 30 K rows) are :
Sudeste have maximum number of orders getting Cancelled whose count is 431Sul have 62 cancellation of orders Norte is having minimal cancellation of orders which counts for 4.df_copy_order_delivered=df_orders_items[df_orders_items['order_status']=="delivered"]
lats = list(df_copy_order_delivered['geolocation_lat'].dropna().values)[:30000]
longs = list(df_copy_order_delivered['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
# Creating a mapa using folium
loc = 'Region of Brazil where orders is getting delivered'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
map2 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map2)
folium.TileLayer('Stamen Terrain').add_to(map2)
map2.get_root().html.add_child(folium.Element(title_html))
map2.save('map-with-order_status_delivered.html')
map2
df_copy_order_shipped.nome_regiao.value_counts()
Sudeste 725 Nordeste 196 Centro-Oeste 112 Sul 104 Norte 30 Name: nome_regiao, dtype: int64
The products which has been in shipped mode, might because of logistic constrains has been shown in the above map:
Sudeste region of Brazil have maximum number of shipped Nordeste region of Brazil have 2nd maximum number of shipped product Norte region of Brazil have least number of product which are in stage of shipped.df_copy_order_shipped=df_orders_items[df_orders_items['order_status']=="shipped"]
lats = list(df_copy_order_shipped['geolocation_lat'].dropna().values)[:30000]
longs = list(df_copy_order_shipped['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
loc = 'Region of Brazil where orders is getting shipped'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
# Creating a mapa using folium
map3 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map3)
folium.TileLayer('Stamen Terrain').add_to(map3)
map3.get_root().html.add_child(folium.Element(title_html))
map3.save('map-with-order_status_shipped.html')
map3
df_copy_order_delivered.nome_regiao.value_counts()
Sudeste 78993 Sul 16484 Nordeste 10596 Centro-Oeste 6706 Norte 2080 Name: nome_regiao, dtype: int64
The above graphs shows the different location of Brazil where order is getting delivered( for the sample of dataset of 30 K rows)
Sudeste region of Brazil have maximum number of product getting delivered for the sample dataset.Sul region of Brazil have 2nd highest number of product getting delivered for the sample dataset Norte region of Brazil where minimal number of product is beind delivered for the choosen dataset.As economy of states vary from the neighbouring state, this question would be interest of the people to know people of which states is seeking some kind of credit based financial support for their expenditures as well. Not just that but also their capability to repay the amount in minimal time.
So by integrating information of two columns payment_type as well as `
Answer of this question will be given by grabing the information from Mode of Payment column of the dataframe df_copy
On reviewing the statistics and other measures of the required column payment_type it's clear that among the several customers visited Olist stores, peoples mostly like to pay their bill via Credit card over other method of payments
df_credit=df_orders_items[df_orders_items['payment_type']=="credit_card"]
lats = list(df_credit['geolocation_lat'].dropna().values)[:30000]
longs = list(df_credit['geolocation_lng'].dropna().values)[:30000]
locations = list(zip(lats, longs))
loc = 'Region of Brazil where maximum number of credit user'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
# Creating a mapa using folium
map2 = folium.Map(location=[-15, -50], zoom_start=4.0)
# Plugin: FastMarkerCluster
FastMarkerCluster(data=locations).add_to(map2)
folium.TileLayer('Stamen Terrain').add_to(map2)
map2.get_root().html.add_child(folium.Element(title_html))
map2.save('map-with-payment_type_credit_card.html')
map2
df_credit.nome_regiao.value_counts()
Sudeste 59848 Sul 11930 Nordeste 8173 Centro-Oeste 5042 Norte 1527 Name: nome_regiao, dtype: int64
Sudeste region of Brazil have maximum number of customers who would like to pay bills at Olist stores via Credit card.Sul region of Brazil have 2nd maximum number of customers who would like to pay pay bills at Olist stores via Credit card.Norte region of Brazil have least number of customers who would like to pay pay bills at Olist stores via Credit card.After grabbing the information in the form of dataset from the kaggle, by following the data schema final version of the existing the csv file has been obtained.Which afterward followed by performing the open ended analysis for individual columns of the dataframe as well the clubbing the information of the two or more required columns to draw better insight from the available data-set.
cama_mesa_banho category of products are being made available on the Olist stores for their customers, which afterward is being followed by other category of products like eletronicos which counts very less for top 15 product category in Olist stores.cama_mesa_banho category of product is being purchased maximum via credit card.Sudeste have use Credit card most whereas people of Norte region use less credit card for payment at olist stores.Sao palo city has maximum number of sellers whose counts as 28K and rest of the cities having lesser counts.cama_mesa_banho have higher counts means customers purchase this product category most whereas fashion_bolsas_e_acessorios have list counts which means less people purchased from Olist stores.cama_mesa_banho,utilidades_domesticas & movies_decoracao have higher counts which give bulkier logistic requirement of seller.Sudeste region of Brazil, whereas very less number of customers belong to the region of Norte of Brazil and for other region of Brazil have equal number of customers.Sudeste region purchase large number of products from Olist stores followed by Sul region afterward , which afterward people from Centro-Oesteregion purchase less number of products and Norte region have negligible number of products being purchased from Olist stores. Which follow parallely with the information above that less number of customers belong to region of Norte region, that lesser number of customers have lesser counts from purchasing the products and services.Sudeste region have maximum number of products ordered by customers get canceled as well as same region of Brazil have maximum number of product getting shipped Sudeste also have maximum number products are getting delivered, which holds incline as maximum number of products shipped is summation of getting products delivered as well as canceled.Apart from the Open-Ended question deal by exploring the individual columns as well as combining the different columns to explore interesting question, it's also required to explore to extract more minute details such as
%%time
jovian.commit()
[jovian] Updating notebook "jhagautamkumar362/eda-project-working" on https://jovian.ai [jovian] Committed successfully! https://jovian.ai/jhagautamkumar362/eda-project-working CPU times: user 81.5 ms, sys: 33.5 ms, total: 115 ms Wall time: 2min 24s
'https://jovian.ai/jhagautamkumar362/eda-project-working'